Recording word position information for improved document categorization
نویسندگان
چکیده
In this paper, which is a report from work in progress, we briefly present the new document representation that could be used in classic text mining applications, such as document categorization. We briefly present most popular unigram and n-gram document representations that are used frequently in text mining research, mention their shortcomings, and present the idea of representation that records not only word count information, but also word position within document. As experiments are still underway, we do not present final result, but only mention types of tests that are being done.
منابع مشابه
Text Categorization
Text categorization is the task of assigning predefined categories to natural language text. With the widely used “bag-ofword” representation, previous researches usually assign a word with values that express whether this word appears in the document concerned or how frequently this word appears. Although these values are useful for text categorization, they have not fully expressed the abunda...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملA Sociological Definition and Categorization of Information Ethics
Background and Aim: This paper aims at the analysis of the definitions and categorizations of the realm of “Information Ethics” to criticize assumptions and clarify points of departure for introducing a new definition and categorization. Method: I used documentary research method and conceptual analysis approach. This method and approach is the best fits with the goal of pursuit roots of social...
متن کاملLearning with Taxonomies: Classifying Documents and Words
Automatically extracting semantic information about word meaning and document topic from text typically involves an extensive number of classes. Such classes may represent predefined word senses, topics or document categories and are often organized in a taxonomy. The latter encodes important information, which should be exploited in learning classifiers from labeled training data. To that exte...
متن کاملFeature Generation for Text Categorization Using World Knowledge
We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies are further enriched by several orders of magnitude through controlled Web crawling. Prior to text...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002